Noise dressing of the correlation matrix of factor models 
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We study the spectral density of factor models of multivariate time series. By making use of the 
Random Matrix Theory we analytically quantify the effect of noise dressing on the spectral density 
due to the finiteness of the sample. We consider a broad range of models ranging from one factor 
models in time and frequency domain to hierarchical multifactor models. 
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The extraction of information from a multivariate time 
series is a central issue in many applications. Several 
methods has been introduced to this end, ranging from 
principal component analysis to clustering methods 0,0- 
The simplest and more widespread models of multivari- 
ate time series are factor models. In these models the 
dynamics of each variable is the linear combination of a 
given number of factors plus an idiosyncratic noise term. 
The coefficients of the linear combination and the inten- 
sity of the noise terms are specific of each variable and 
assumed for simplicity to be time independent. Examples 
of such models are Capital Asset Pricing Model or CAPM 
(one factor) and Arbitrage Pricing Theory (multifactor) 
in the financial domain [3j . Another class of factor mod- 
els describes the dynamics of the variables driven by one 
or more sinusoidal signals of given frequency with each 
variable characterized by a different phase. This kind of 
models as been recently applied to gene expression analy- 
sis during cell cycle obtained by microarray data In 
this second case the variables are following the common 
factor(s) in frequency domain rather than in real time. 
The more general multifactor model for N variables x» (t) 
(i = 1, ...,N) can be written as 
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In this equation K is the number of factors fj(t), is 
a constant describing the weight of factor j in explaining 
the dynamics of the variable X{, and ti{t) is a Gaussian 
zero mean noise term with unit variance. In Eq.(l) we 
assume that the factors are uncorrelated one with each 
other, i.e. (fi(t)fj(t)) — where the symbol (...) indi- 
cates an average in time. Also the noise terms are un- 
correlated one with each other and with the factors, i.e. 
(ei(t)ej(t)} = Sij and (fi(t)ej(t)) — 0. Since in the rest of 
this paper we are interested in studying the linear corre- 
lation coefficients, without loss of generality we assume 
that all the variables Xi have zero mean and unit variance. 



These assumptions fix the value {"fl"') 2 = 1 — Yljili ) 2 - 
Multivariate methods are designed to extract the infor- 
mation on the number of factors and on the composition 
of the groups. On the other hand any real experiment 
is performed on a finite sample of T records for each 



variable and the measured quantities in the analysis are 
unavoidably dressed by noise. In this letter we study the 
role of the noise in dressing the properties of the spectral 
density of the correlation matrix. The correlation matrix 
C is the N x N symmetric matrix whose entries are the 
linear correlation coefficient between each pair of variable 
Xi and Xj. The object of our study is the spectral density 
of the sample correlation matrix C. The square root of 
the eigenvalues of C are called singular values. We will 
make use of the Random Matrix Theory 6] to compute 
the role of noise dressing on the spectrum of correlation 
matrices of factor models. Most of the results we de- 
rive are valid in the limit of N — > 00 and T — > 00, even 
though for real large matrices the approximation is quite 
good. The application of Random Matrix Theory to the 
noise dressingof correlation matrices has been recently 
addressed UlE m and a ppli ed to the study of financial 
correlation matrices (Hill E3- In Ref. [H EH the 
null hypothesis used to compare real sample correlation 
matrices is a model of uncorrelated variables, or in our 
language a zero factor model or random model. In the 
random model each variable is described only by a ran- 
dom Gaussian variable €i(t). The corresponding class of 
random matrices is called Wishart matrix in statistical 
literature. The correlation matrix is the identity matrix 
and the spectral density is p{\) — NS(X — 1). The noise 
dressing of the spectrum of the sample correlation matrix 
has been derived in 0, ■ In the limit T, N — > 00 , with 
a fixed ratio Q = T/N > 1, the spectral density of the 
correlation matrix is given by 
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where A™^ 



(1 + 1/Q ± 2-y/T/Q). Since the procedure 
used to obtain this result is very similar to the one we 
use below, here we summarize it briefly. One introduces 
the resolvent 
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which is related to the spectral density through 
p(X) = -]imIm[g(X- ie)}. 
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The resolvent is equal to Q{z) = d z \ndet(z — p). By 
making use of the replica trick and by performing a sad- 
dle point approximation one finds an equation for the 
ensemble average of the resolvent G(z) = (Q{z)) ens and 
through Eq. (4) the spectral density is obtained. The 
symbol (...) ens indicates an average on the ensemble of 
variables. 

Even if the random model can be sometimes a starting 
null hypothesis, a factor model should be used as null 
hypothesis when one suspects the presence of common 
factors in the dynamics of an ensemble of variables. As 
a first example we shall consider the one factor model 
in which the dynamics of each variable is controlled by 
a single common factor. The equations describing the 
one factor model is given by Eq. (1) with K = 1. The 
parameter 7? gives the fraction of variance explained by 
the common factor fit). It is direct to show that the cor- 
relation coefficient between variable i and j is 7,7^ . The 
correlation matrix of the one factor model can therefore 
be written as C = A + bb + , where A = diag(l — 7?) is 
a diagonal N x N matrix and b + = (71, ...,7jv) is a row 
vector. The characteristic equation of C can be calcu- 
lated by using the Sherman-Morrison formula [T^J, and 
the result is 
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FIG. 1: (a) The low part of the spectrum of a single realization 
of a non degenerate one factor model (circle). In this case 
N = 2000 and 7 = 0.25. The continuous line is the prediction 
based on the ansatz discussed in the text, (b) Spectral density 
of a degenerate one factor model. The gray areas are the 
average over 1000 numerical simulations of a one factor model 
of N = 100 variables for T = 500 time steps. The value of 7 
is 0.25. The dashed line is the theoretical prediction. 
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In the case of a degenerate one factor model, i.e. when 
7i = 7 for all values of i, the characteristic equation can 
be solved and the spectrum is composed by a large eigen- 
value Ai = 1 + (N — 1)7 2 ~ A7 2 and N — 1 degenerates 
eigenvalues A,; = 1 — 7 2 where i — 2,...,N 0]. We are 
not able to solve Eq.(5) in the non degenerate case, but 
we are able to provide an approximate form of the spec- 
trum when N is large. In the non-degenerate case we 
can still expect a spectral density composed by a large 
eigenvalue (high part) and N — 1 small eigenvalues (low 
part). The large eigenvalue can be obtained by putting 
to zero the term in square brackets in Eq.(5). The largest 
eigenvalue is much larger than 1 — 7 2 for any i and we 
can therefore approximate the characteristic equation as 
£ 7 ?/Ai ~ 1, i.e. Ai ~ £ti7 2 = N(^) ens . In or- 
der to have an insight on the spectral density for the 
other TV — 1 eigenvalues we assume that the ji are dis- 
tributed according to a given probability density P(7i). 
We make the ansatz that the distribution of A^ is given 
by P(A) = P(7i)d7i/dA where 7, = y/l - A. The idea 
behind this ansatz is that the relation between eigenval- 
ues and 7i is the same as in the degenerate case. For 
example if 7$ is distributed uniformly in a subinterval 
[m — d,m + d] of [0,1], the distribution of the TV — 1 eigen- 
values is given by P(A) = (4d) _1 (l - A)~ 1/2 . Note that 
under our ansatz this part of the spectrum is bounded 
from above by the value A = 1. In panel (a) of Fig. 1 
we show the low part of the spectral density of a one 
factor model describing N = 2000 variables. The 7.; are 
distributed exponentially with 7 = 0.25. The line is the 



theoretical prediction based on our ansatz. The agree- 
ment between data and the ansatz is quite good. 

We now consider the noise dressing of the spectrum due 
to the finitcness of the sample. In other words we assume 
that our data are described by a one factor model (degen- 
erate or not) and we assume that we have T synchronous 
records for each Xi variable. By using the arguments of 
Ref. @ we can prove that the ensemble averaged resol- 
vent of a one factor model of N variables for T time steps 
is determined by the equation 
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For the degenerate one factor model discussed above the 
equation for the resolvent is a third degree algebraic equa- 
tion 0] , which can be solved exactly. The spectral den- 
sity can be obtained analytically from Eq.s (4) and (6), 
even if the expression is quite long. As expected the spec- 
tral density is different from zero in two intervals, one for 
the N — 1 small eigenvalues and one for the large eigen- 
value Ai. Numerical calculations and analytical consid- 
erations show that the low part of the spectrum is well 
fitted by a functional form of Eq.(2). Moreover the width 
of the two intervals scale with the parameter of the model 
as A ~ (1 - -f 2 )y/W/T and A 1 ~ Nj 2 /VT, where A 
(Ai) is the width of the low (high) part of the spectrum. 
Panel (b) of Fig. 1 shows the comparison of theoretical 
prediction and numerical simulations of a degenerate one 
factor model. The agreement is very good in the whole 
range of eigenvalues. It is worth noting that such an 
agreement is obtained also when T < N. 

We come now to the more complicated case of non de- 
generate one factor model. In order to find the spectrum 
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FIG. 2: Low part of the spectral density of a completely non 
degenerate one factor model, i.e. a one factor model in which 
7i is distributed uniformly between and 1. The dashed line is 
the theoretical result obtained through the theory developed 
in the text. 



one should solve Eq.(5) for all the eigenvalues and then 
solve the (N + l)th degree polynomial of Eq.(6). This 
task is too complicated even numerically. We use a dif- 
ferent approach making use of the fact that N is large. 
The sum in the denominator of Eq. (6) is split in a term 
for Ai plus a sum over the remaining N — 1 small eigen- 
values. This last term can be computed as N — 1 times 
the average of A/ (T— \G(z)) over P(X) introduced in our 
ansatz. In other words, Eq. (6) for the resolvent becomes 



G(z) = 
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where Ai ~ 1 + (N — l)( , yf) ens . In general the average 
term in (7) is not a rational function and therefore Eq.(7) 
cannot be reduced to an algebraic equation in G. In or- 
der to solve the complex transcendental equation (7) we 
introduce a simple algorithm. The average term in (7) 
depends typically on the dispersion of the P(ji) at the 
second order. Therefore the low part of the spectrum of a 
non degenerate one factor model with a small dispersion 
in 7j is not very different from a degenerate one-factor 
model with j e ff — \/ {jf) e ns- We can therefore use the 
value of the resolvent of this effective degenerate one fac- 
tor model as the starting point for the searching of the 
solution of Eq. (7). Quite surprisingly this method works 
also when the dispersion of the 7; is high. For example 
for a uniform distribution of ji, the average term in Eq . 
(7) involves two inverse hyperbolic tangent functions [13j . 
In Fig. 2 we show the low part of the spectrum for a com- 
pletely degenerate one factor model, i.e. a model in which 
7i is uniformly distributed between and 1. We see that 
the agreement between the theory and the simulations is 
very good. A similar good agreement is observed for ex- 
ponentially distributed 7, |l3( . It is worth noting that in 



the general case of a degenerate one factor model the low 
part of the spectrum is not compatible with a Wishart 
form of Eq.(2). 

The results obtained for the one factor model can be 
easily extended to multifactor model. When the factors 
are stochastic and uncorrelated one with each other the 
structure of the correlation matrix is given by the com- 
position of the groups of variables correspondent to the 
factors. The simplest case is when each variable belongs 
to one and only one groups, i.e. its dynamics is deter- 
mined by only one factor and by the idiosyncratic noise. 
In this case the correlation matrix is block diagonal. The 
correlation coefficient between variables belonging to dif- 
ferent groups is zero, while, when the variables i and j 
belong to the same group k, the correlation coefficient is 
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The spectral density of this kind of models is 



simply given by the superposition of the spectral densi- 
ties of K one factor models. The theory of noise dress- 
ing follows directly from Eq. (6) in which the number 
of distinct eigenvalues is 2K. The equation for G(z) is 
therefore an algebraic equation of degree 2K+1. Again if 
the distributions of 7^ are given, one can solve the non 
degenerate case by using the same arguments of the one 
factor model. Clearly the computational task increases 
with the number of factors. 

An interesting generalization of multifactor models oc- 
curs when there is a hierarchical overlap between differ- 
ent groups. To give a concrete example, let us consider a 
portfolio of stocks. As a first approximation we can con- 
sider the portfolio as composed by a large group following 
a common factor (the market factor in CAPM) and a cer- 
tain number of groups homogeneous in economic activity 
following a sectoral factor, such as, for example, the oil 
companies or the technological group. In this case the 
composition of the groups induces a hierarchical struc- 
ture to the correlation matrix. Also this kind of models 
can be solved. We present here a simple example in which 
the N variables follow a common factor with a constant 
r. Moreover the set of variables is divided in two groups. 
We have ri\ variables following the first subfactor with 
constant 71 and n 2 — N — n\ variables following the 
second subfactor with constant 72. The correlation ma- 
trix of this model is a block matrix composed by a block 
m x m matrix whose entries are T 2 + y 1 and 1 on the 
diagonal and a block n 2 x n 2 matrix whose entries are 
r 2 + 7I and 1 on the diagonal. The elements in the out 
diagonal block submatrices are equal to T 2 . 

The spectrum of this matrix is composed by two large 
eigenvalues given by 

A± = -(2 + 7i 2 ("i - 1) + 72V2 - 1) + r 2 K + n 2 - 2) 



±^A 2 + r 4 (m + n 2 ) 2 + 2AT 2 ( ni - n 2 )),(8) 

where A = (7 2 (ni — 1) — 72(^2 — 1)) and n\ — 1 eigenval- 
ues equal to 1 — T 2 — ■y 2 and n 2 — 1 eigenvalues equal to 
1 — r 2 — 7|. Again by making use of Eq. (6) it is possible 
to find the effect of noise dressing by solving the corre- 
sponding 5th degree algebraic equation [13j. To give the 
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general idea behind the derivation of result (8) and of its 
generalization to more complex hierarchical models, we 
note that correlation matrix of degenerate hierarchical 
factor models can be written in terms of the identity ma- 
trix and unit matrices 3 mn (i.e. a m x n matrix consist- 
ing of all Is). These matrices form a closed algebra under 
multiplication, for example 3 nm 3 mp = mJ np . The deter- 
minant leading to the characteristic equation is directly 
obtained by using the block submatrices in which the cor- 
relation matrix can be partitioned. The above mentioned 
algebraic properties allows to reduce the characteristic 
equation in terms of det(al n + bJ nn ) = a n_1 (a + bn). 

The last class of model we consider is given by a one 
factor model in which the synchronization with the factor 
is in frequency rather than in time domain. Specifically, 
let us consider a factor model in which the dynamics of 
the variables is described by the equation 

Xi(t) = jV2sin{ut + fa) + 7 (0) e 4 (i). (9) 

The ensemble average of the product of two variables 
in two distinct instants of time can be written as 
(xi(t)xj(t')) ens = CijD tV: where C!y = 5 tj and D tv = 
(7 2 cos(w(£ — t')) + (1 - 7 2 )<5t*')- The model described in 
Eq. (9) is the first approximation of the dynamics of the 
level of expression of genes during cell cycle as detected in 
microarray experiments [4J, |5( . In this case the frequency 
uj is related to the duration of the cell cycle. Microarray 
experiments have usually a very small number of time 
points compared with the number of variables (genes). 
This fact leads to an heavy dressing of the correlation 
matrix by noise, and hence a careful characterization of 
the noise dressing is even more important in this case. In 
order to find the effect of noise dressing by using Random 
Matrix Theory we need to find the spectral density of the 
matrices C = (Cy) and D = (D tt i) 8]. In this case C 
is the identity matrix. The spectrum of D can be found 
and it consists of two large eigenvalues 

^=n T± ^) +(i ~ 7) ' (io) 



and T — 2 eigenvalues equal to = (1 — 7 2 ), where 
i = 3, T . The equation for the resolvent is in this case 
(cfr. Eq.s (21-23) of Ref. @) 

where Q(z) is the solution of the equations 

T 

The equation for G(z) is a 4th degree algebraic equation 
that can be solved exactly [T^ |. 

In conclusion we have shown that the application of 
Random Matrix Theory allows to solve analytically the 
problem of the noise dressing of the spectral density of the 
correlation matrix of a large class of factor models. This 
class includes one factor model in time and frequency 
domain and hierarchical and non hierarchical multifactor 
models. The main idea of the approach we are proposing 
can be summarized as follows: (i) find the spectrum of 
the degenerate factor model; this usually accounts to find 
the few distinct eigenvalues characterizing the spectrum, 
(ii) Find the effect of the noise dressing in the degenerate 
case by using Eq. (6); this implies the finding of roots 
of a low degree polynomial, (iii) Try to solve the non 
degenerate case by averaging over the distribution of 7 
parameters on the same lines of what has been done here 
for one factor model (see Eq.(7)). Our results can be 
applied in many different disciplines including economics, 
finance, molecular biology and in general in any study 
in which factor models can constitute a good starting 
point for modeling the simultaneous dynamics of many 
variables. 
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